library(tidyverse)
library(plotly)
library(gridExtra)
library(reshape2)
library(DT)
data <- read.csv("data.csv", encoding = 'UTF-8') %>% filter(year <= 2020)
data <- data %>%
select(id, name, artists, year, release_date, everything())
The catchphrase Data is the new currency has become quite prevalent in our day to day conversations and news feed. As per the statista report released in May 2020, this year is expected to consume 74 zettabytes (1 zettabyte = trillion gigabytes) of data worldwide. This number is projected to almost double by 2024.
Statista report on data creation/consumption
Through groundbreaking inventions in data storage systems, we can find data on almost any subject these days, ranging from the absurd ones (squirrel census in Central Park) to the arcane ones (dermatoscopic images) or even the personal ones (Whatsapp chats/location history). With this kind of data accessibility, we can learn about unfamiliar subjects in unconventional and fascinating medium of data analysis.
Having said that, analyzing data requires understanding the definition of data attributes, the quality of the data, and methodical analysis to not derive incorrect conclusions. Moreover, to avoid getting stuck in the state of analysis paralysis, we should have a sense of the questions we would like to answer through the analysis.
Exploratory data analysis (EDA) is an approach in data science to derive insights from the given data while understanding the structure, anomalies, and interplays of various data attributes. The insights from EDA can be further distilled to enable stakeholders take data driven decisions and to develop and validate better prediction models.
As an avid listener of Indian classical and pop music, I attempt to expand my musical taste buds through EDA on world music data in this article. I will be using data from Spotify - a major audio streaming service.
Some of the questions of interest are:
Let’s first get an overview of Spotify data and assess the quality of the same.
I have obtained all Spotify tracks released between 1920 to 2020 (100 years!) from a Kaggle dataset. Each row ideally represents a track, with a Spotify track id, artist’s name, track’s name, and a bunch of audio features associated with the track. A sample of the data from year 2020 is shown below.
datatable(
head(
data %>%
filter(year==2020)
),
rownames = FALSE,
options = list(dom = 'tp',
pageLength = 5,
scrollX = TRUE)
)
We have few categorical fields in the data like id, name, artists, release_date, and key that can take finite number of values. We also have many numerical fields, mostly the audio features, like acousticness, danceability etc., that can take infinite number of values within a defined range. Notice, that two fields - mode and explicit - are logical/binary in nature.
We now need to understand the definition of each field to then use them for our analyses. More specifically, I was interested to understand some definitions of the audio features like loudness, energy, danceability, popularity, etc. I was quite intrigued by the definition of popularity measure, which is derived from a Spotify algorithm that relies on number of plays a track has had and the recency of those plays. The table below has the definitions of all the fields.
data_dict <- tibble("field" = names(data),
"definition" = c("Track unique id",
"Track name",
"Artist name",
"Year the track was released",
"Date the track was released",
"A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.",
"Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.",
"Duration of the song in milliseconds",
"Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.",
"If the track contains explicit content",
"Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.",
"The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C#/Db, 2 = D, and so on. If no key was detected, the value is -1.",
"Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.",
"The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.",
"Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.",
"The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past.",
"Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.",
"The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.",
"A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)."))
datatable(
data_dict,
rownames = FALSE,
options = list(dom = 'tp',
pageLength = 10,
scrollX = TRUE),
caption = "Sourced from https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md"
)
There are a couple of issues with this data that make it inconsistent with the ideal state i.e. one record per track. There are some records where all the fields are duplicated, multiple times. In addition to that, few tracks appear multiple times with the same track name and artist with different track id. This happens because a track can be added to Spotify as a single entry and/or as a part of an album with separate ids for each addition. The same track can also be added multiple times with different release dates under different licenses in different markets. We need to overcome this issue in the data to not double count the statistics which could result in misleading conclusions.
One way to tackle this issue is to define the track as a combination of an artist and a track name and keep the track that was released the earliest. For example, Rain On Me (with Ariana Grande) track has two entries in the data with release dates of May 22 2020 and May 29 2020 from which we would choose the track released on May 22 2020. By following this logic, we will not be favoring the popularity measure (correlated with recent number of plays) for any track that could have the same track + artist combination and has releases across various years with varying popularity.
After cleaning the data, we now have total of 158,581 tracks - a reduction of ~14,000 records with duplicate entries.
datatable(
data %>%
filter(name=="Rain On Me (with Ariana Grande)"),
rownames = FALSE,
options = list(dom = 'tp',
pageLength = 5,
scrollX = TRUE)
)
key_map <- rev(c("0" = "C", "1" = "C#", "2" = "D", "3" = "D#", "4" = "E", "5" = "F", "6" = "F#", "7" = "G", "8" = "G#", "9" = "A", "10" = "A#", "11" = "B"))
data_prc <- data %>%
mutate(id = as.character(id)) %>%
#select(-id, -release_date) %>% # duplicate records for different ids and release date
distinct() %>%
# same song with same artist have multiple records with slight changes in audio featues and year
arrange(name, artists, year, release_date) %>%
group_by(name, artists) %>%
filter(id==min(id)) %>%
#filter(popularity == max(popularity)) %>%
ungroup() %>%
mutate(popularity_category = ifelse(popularity >= 80, "80+", "<80"),
valence_bin = cut(valence, seq(0,1,0.1), right = FALSE),
duration_min = duration_ms/(1000*60),
mode_type = case_when(mode==0 ~ "minor",
mode==1 ~ "major"),
key_str = as.character(key),
key_group = str_replace_all(key_str, key_map))
Now, we have a fair understanding of the data and have processed it appropriately so let’s observe some trends!
The number of tracks released fluctuates a lot between 1920 and 1950, probably because of the limited capacity for production. Since 1950, around 2,000 tracks were released consistently until 1999. The number dropped by almost a half in 2000-2001 and is steadily increasing since 2004. Interestingly, the number of tracks shot up by 1.5 in 2020. I am reining my temptation to associate reasons with the high-level fluctuations in the number of tracks over years since Spotify is not the exclusive platform to host all the music that gets created around the globe. However, the trends are certainly intriguing and demand further investigation.
The number of unique artists who released tracks by years has an increasing trend across all years, including the years when the number of tracks were stable around 2,000. This suggests increased accessibility for even novice artists to publish their songs on Spotify.
ggplotly(
data_prc %>%
group_by(year) %>%
summarise(tracks = n(),
artists = n_distinct(artists)) %>%
ungroup() %>%
ggplot(aes(x = year, y = n, group = 1)) +
geom_line(aes(y = tracks, color = "tracks")) +
geom_point(aes(y = tracks, color = "tracks")) +
geom_line(aes(y = artists, color = "artists")) +
geom_point(aes(y = artists, color = "artists")) +
scale_x_continuous(breaks = seq(1910, 2020, 10)) +
scale_color_manual(values = c("tracks" = "#8d52eb", "artists" = "#ec576c")) +
theme(axis.text.x = element_text(angle = 90),
legend.title = element_blank()) +
labs(y = "tracks vs artists")
)
#ggsave("./imgs/track_artist_trend.png")
The mean popularity has gone down since 2000 while the maximum popularity has been on the rise at the same time. Despite having more tracks and more artists in the recent years, not all gained higher popularity. 2020 recorded the highest maximum popularity score of 96 out of 100. This could be partly because of the method of popularity calculation which weighs more on number of plays from the recent time period, making the tracks released in the recent years more popular. However, with the increased reach of the internet, more people are listening to the tracks on Spotify than before - increasing the number of overall plays.
ggplotly(
data_prc %>%
group_by(year) %>%
summarise(mean_popularity = mean(popularity),
max_popularity = max(popularity)) %>%
ggplot(aes(x = year, group = 1)) +
geom_line(aes(y = mean_popularity, color = "mean_popularity")) +
geom_point(aes(y = mean_popularity, color = "mean_popularity")) +
geom_line(aes(y = max_popularity, color = "max_popularity")) +
geom_point(aes(y = max_popularity, color = "max_popularity")) +
scale_x_continuous(breaks = seq(1910, 2020, 10)) +
scale_color_manual(values = c("mean_popularity" = "#aaf6b1", "max_popularity" = "#019875")) +
labs(y = "mean and max popularity") +
theme(axis.text.x = element_text(angle = 90),
legend.title = element_blank())
)
#ggsave("./imgs/popularity_trend.png")
Like trends in popularity score, it will be interesting to check out the trends in audio features over 100 years as well. You might have noticed that all the audio features have different scales. For example, loudness ranges from 0 (most loud) to -60 (least loud) while popularity ranges from 0 to 100. Therefore, to compare them on a standard scale, we need to normalize their numerical values between a standard range, say 0 to 1.
The average acousticness has drastically reduced over the years. This could be because of the inventions of more electronic instruments. The average energy of tracks has drammatically increased over the years, almost at the same time acousticness started reducing. Could there be a correlation there?
The average loudness of the tracks has slightly increased. The average valence has slightly decreased on the other hand. Instrumentalness was more present in earlier years although there has been an uptick in the recent years for the same.
rescale <- function(x) (x-min(x))/(max(x) - min(x))
scales_data_prc <- data_prc %>%
mutate(year = as.character(year)) %>%
mutate_if(is.numeric, ~rescale(.)) %>%
mutate(year = as.integer(year))
audio_features_2 <- c("acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence")
ggplotly(
scales_data_prc %>%
pivot_longer(cols = all_of(audio_features_2), names_to = "feature_name", values_to = "feature_value") %>%
group_by(year, feature_name) %>%
summarise(mean_feature_value = mean(feature_value)) %>%
ungroup() %>%
ggplot(aes(x = year, y = mean_feature_value, color = feature_name)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(1910, 2020, 10)) +
scale_color_brewer(palette = "Set3")
)
#ggsave("./imgs/audio_feature_trend.png")
We need to analyze trends for audio features that are either logical or categorical separately because average values for them would be difficult to comprehend.
There are more tracks set in minor mode in recent years. Generally speaking, minor mode tracks tend to have a sad/gloomy mood. This is in alignment with decreasing valence in the recent years.
ggplotly(
data_prc %>%
count(year, mode_type) %>%
group_by(year) %>%
mutate(perc_tracks = n/sum(n)) %>%
ungroup() %>%
ggplot(aes(x = year, y = perc_tracks, fill = mode_type, group = 1)) +
geom_col() +
scale_x_continuous(breaks = seq(1910, 2020, 10))
)
#ggsave("./imgs/mode_trend.png")
Percentage of tracks with explicit content has been increasing since 1980s with two peaks around 2000 and 2018. Some of the tracks with explicit content are not songs but comedy bits from comedians. For example, the comedians Todd Glass and Blake Wexler added 44 tracks with explicit content and mean speechiness of 0.84 in 2018.
ggplotly(
data_prc %>%
count(year, explicit) %>%
group_by(year) %>%
mutate(perc_tracks = n/sum(n)) %>%
ungroup() %>%
ggplot(aes(x = year, y = perc_tracks, fill = factor(explicit))) +
geom_col() +
scale_x_continuous(breaks = seq(1910, 2020, 10))
)
#ggsave("./imgs/explicit_trend.png")
Average duration of the tracks in minutes has been fluctuating between 4 to 4.5 minutes since 1970. Spotify has collected 1.17 years worth of tracks over 100 years!!
ggplotly(
data_prc %>%
group_by(year) %>%
summarise(mean_dur = mean(duration_min)) %>%
ungroup() %>%
ggplot(aes(x = year, y = mean_dur)) +
geom_line(color = "#800000") + geom_point(color = "#800000") +
labs(y = "mean duration (mins)") +
scale_x_continuous(breaks = seq(1910, 2020, 10))
)
#ggsave("./imgs/duration_trend.png")
Who does not enjoy popularity? Let’s see if we can find the secret behind the most popular track on Spotify.
There are two most popular tracks released in 2020 that gained popularity score of 96 - positions by Ariana Grande and Mood (feat. iann dior) by 24kGoldn.
Both the tracks are on the higher end of valence (happy mood), energy, and danceability. Both have explicit content present and have 0 instrumentalness. Positions is set in a major mode while Mood (feat. iann dior) is set in a minor mode.
audio_features <- c("acousticness", "danceability", "duration_min", "energy", "instrumentalness", "explicit", "liveness", "loudness", "key", "mode", "speechiness", "tempo", "valence")
ggplotly(
data_prc %>%
filter(popularity==max(popularity)) %>%
pivot_longer(cols = all_of(audio_features), names_to = "feature_name", values_to = "feature_value") %>%
ggplot(aes(x = name, y = feature_value, fill = name)) +
geom_col() +
facet_wrap(~feature_name, scales = "free_y") +
theme(axis.text.x = element_text(angle = 45))
)
#ggsave("./imgs/popular_track.png")
Can we conclude from these two tracks that having high energy, danceability, and explicit content is the key to higher popularity? We should probably analyze further by understanding the interplays of audio features with each other and with popularity to answer this question.
The graph below shows correlation (pearson’s) of each audio feature with other audio features. Note that the mode and key are excluded from this matrix as they are kind of categorical variables. Colors closer to orange showcase higher positive correlation, colors closer to purple showcase no correlation, while colors closer to dark blue show negative correlation. It is quite trivial to point out that each audio feature is correlated to itself with 100% positive correlation.
Energy is negatively correlated with acousticness which aligns with what we observed in the trends earlier. Energy increased at the same time acousticness decreased.
Loudness is highly correlated with energy which makes loudness negatively correlated with acousticness as well.
Danceability is positively correlated with valence of the track, not surprising!
Popularity is negatively correlated with acousticness, instrumentalness, and speechiness while positively correlated with loudness and energy.
ggplotly(
melt(cor(data_prc %>%
select(c(audio_features_2, "popularity")))) %>%
ggplot(aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "#003366", high = "orange", mid = "purple") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "", y = "")
)
#ggsave("./imgs/corr.png")
Another interesting factor affecting popularity is the presence of explicit content in tracks. Since it is a logical variable, plotting the popularity distribution by each value of explicit field is more appropriate to analyze its impact on popularity. The violin chart below shows the density of popularity distribution in the vertical direction while the horizontal lines show 25th, 50th, and 75th percentile of the distribution. 75% of tracks with explicit content had popularity of no more than 65 while 75% of tracks without explicit content had popularity of no more than 42. This indicates that more explicit content tends to be more popular.
data_prc %>%
ggplot(aes(x = factor(explicit), y = popularity)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
labs(x = "explicit content")
#ggsave("./imgs/explicit_popularity.png")
Some of the above mentioned correlation observations are visualized in one chart below to fit linear trends. Such correlation plots could be very useful to understand the importance of each feature in predicting a variable, for example popularity.
It seems that tracks with more energy, loudness, explicit content and less instrumentalness, acousticness tend to be more popular. After all, this conclusion is not very far from our earlier conclusion on popularity from the most popular tracks.
p_acoustic <- data_prc %>%
mutate(acoustic_bin = cut(acousticness, seq(0,1.1,0.0001), right = FALSE)) %>%
group_by(acoustic_bin) %>%
summarise(mean_popularity = mean(popularity),
mean_acoustic = mean(acousticness)) %>%
ungroup() %>%
ggplot(aes(x = mean_acoustic, y = mean_popularity)) +
geom_point(alpha = 0.2, size = 3) +
geom_smooth(method = "lm") +
labs(title = str_glue("correlation = {round(cor(data_prc$acousticness, data_prc$popularity),2)}"))
p_energy <- data_prc %>%
mutate(energy_bin = cut(energy, seq(0,1.1,0.0001), right = FALSE)) %>%
group_by(energy_bin) %>%
summarise(mean_popularity = mean(popularity),
mean_energy = mean(energy)) %>%
ungroup() %>%
ggplot(aes(x = mean_energy, y = mean_popularity)) +
geom_point(alpha = 0.2, size = 3) +
geom_smooth(method = "lm") +
labs(title = str_glue("correlation = {round(cor(data_prc$energy, data_prc$popularity),2)}"))
p_energy_loudness <- data_prc %>%
mutate(loudness_bin = cut(loudness, seq(0,-60,-0.01), right = FALSE)) %>%
group_by(loudness_bin) %>%
summarise(mean_energy = mean(energy),
mean_loudness = mean(loudness)) %>%
ungroup() %>%
ggplot(aes(x = mean_loudness, y = mean_energy)) +
geom_point(alpha = 0.2, size = 3) +
geom_smooth(method = "lm") +
labs(title = str_glue("correlation = {round(cor(data_prc$energy, data_prc$loudness),2)}"))
p_valence_dance <- data_prc %>%
mutate(dance_bin = cut(danceability, seq(0,1,0.0001), right = FALSE)) %>%
group_by(dance_bin) %>%
summarise(mean_valence = mean(valence),
mean_dance = mean(danceability)) %>%
ungroup() %>%
ggplot(aes(x = mean_dance, y = mean_valence)) +
geom_point(alpha = 0.2, size = 3) +
geom_smooth(method = "lm") +
labs(title = str_glue("correlation = {round(cor(data_prc$valence, data_prc$danceability),2)}"))
grid.arrange(p_acoustic, p_energy, p_energy_loudness, p_valence_dance, nrow = 2)
#ggsave("./imgs/linear_trend.png")
After visualizing trends and correlation of audio features, let’s find out which artists have been the most productive on Spotify over 100 years.
The chart below shows top 20 artists that have the highest number of tracks on Spotify. The shading represents time between release of their first track and last track. Top 4 most productive artists by a bigger margin have been active for < 30 years between 1920 to 1950. Tadeusz Dolega Mostowicz is actually a Polish writer and have his books posted in chapters on Spotify. Higher number of active years may have some correlation with popularity as they could consistently release tracks over many years and their tracks get played more at all times. Ella Fitzgerald is one of the early artists (from 1920s) whose last track was added in 1999 yet ranks pretty high on the popularity spectrum (73). Quite incredible! Frank Sinatra has been active for most number of years (80 years) among the top 20 artists.
As a devout lover of old Indian songs, I was delighted to see a highly revered Indian artist - Lata Mangeshkar - right between The Beatles and Queeen.
ggplotly(
data_prc %>%
group_by(artists) %>%
summarise(n_songs = n(),
first_activity = min(year),
last_activity = max(year)) %>%
ungroup() %>%
mutate(years_active = last_activity - first_activity + 1) %>%
arrange(desc(n_songs)) %>%
head(20) %>%
ggplot(aes(x = reorder(artists, n_songs), y = n_songs, fill = years_active)) +
geom_col() +
scale_fill_continuous(high = "#132B43", low = "#56B1F7") +
labs(y = "artist", x = "tracks") +
coord_flip()
)
#ggsave("./imgs/productive.png")
Let’s quickly go through audio features of the tracks by few well known artists.
The most popular song of Lata Mangeshkar on Spotify is Aaj Phir Jeene Ki Tamanna Hai released in 1965. Below chart shows that her tracks are on the high spectrum of acousticness (not surprising). Danceability is in the medium range. Her tracks on Spotify are mostly on the higher end of valence. Since Spotify might have limited number of tracks from this prolific artist, it might be best not to conclude anything in particular.
# data_prc %>%
# filter(artists=="['Lata Mangeshkar']") %>%
# arrange(desc(popularity))
ggplotly(
scales_data_prc %>%
filter(artists=="['Lata Mangeshkar']") %>%
pivot_longer(audio_features_2, names_to = "feature_name", values_to = "feature_value") %>%
ggplot(aes(x = feature_name, y = feature_value, color = feature_name)) +
geom_jitter(size = 3, alpha = 0.4) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer(palette = "Set3")
)
#ggsave("./imgs/lata_mangeshkar.png")
The Beatles: Come Together - Remastered 2009 is the most popular track with popularity score of 78
Queen: Don’t Stop Me Now - Remastered 2011 is the most popular track with popularity score of 73
Coldplay: Yellow is the most popular track with popularity score of 85
Below is their anatomy of audio features.
The Beatles and Queen are not too different from each other in terms of their audio feature profile. Coldplay is high on loudness, medium on danceability, and low on valence.
# data_prc %>%
# filter(artists=="['The Beatles']") %>%
# arrange(desc(popularity))
ggplotly(scales_data_prc %>%
filter(artists=="['The Beatles']") %>%
pivot_longer(audio_features_2, names_to = "feature_name", values_to = "feature_value") %>%
ggplot(aes(x = feature_name, y = feature_value, color = feature_name)) +
geom_jitter(size = 3, alpha = 0.5) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer(palette = "Set3")+
labs(title = "The Beatles"))
#ggsave("./imgs/beatles.png")
# data_prc %>%
# filter(artists=="['Queen']") %>%
# arrange(desc(popularity))
ggplotly(scales_data_prc %>%
filter(artists=="['Queen']") %>%
pivot_longer(audio_features_2, names_to = "feature_name", values_to = "feature_value") %>%
ggplot(aes(x = feature_name, y = feature_value, color = feature_name)) +
geom_jitter(size = 3, alpha = 0.5) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer(palette = "Set3") +
labs(title = "Queen"))
#ggsave("./imgs/queen.png")
# data_prc %>%
# filter(artists=="['Coldplay']") %>%
# arrange(desc(popularity))
ggplotly(scales_data_prc %>%
filter(artists=="['Coldplay']") %>%
pivot_longer(audio_features_2, names_to = "feature_name", values_to = "feature_value") %>%
ggplot(aes(x = feature_name, y = feature_value, color = feature_name)) +
geom_jitter(size = 3, alpha = 0.5) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer(palette = "Set3") +
labs(title = "Coldplay"))
#ggsave("./imgs/coldplay.png")
One question that came to mind while exploring the data is what the most commonly used words across all tracks can tell us about the core inspiration of music.
Top 100 words from 100 years have love and live as the most common words in all track names after excluding commonly used words like a/an/the/you/me/I/mixed/remaster etc. and years. I wonder which live - the verb or the noun - has been used the most commonly in track names. One could dissect the keyword live by the liveness audio feature to get a better distinction or it might require another blog on sentiment analysis. For now, let’s wish it is live - the verb.
As a side note, I also checked the most commonly used word by each year from 1990 and love appeared for most of the years.
It is quite interesting to observe words related to work out to be more common words in tracks of 2020.
My purpose to understand music and artists from other genres got fairly satisfied through this EDA. I hope you had some of your musical and analytical curiosity satisfied as well through this article. I am certainly going to listen to tracks from all 20 most productive artists, especially Ella Fitzgerald and the most popular tracks ever recorded on Spotify positions and Mood (feat. iann dior). I am also guilty of dancing on the most danceable song - Funky Cold Medina by Tone-Loc while writing this article.
The data and code is freely available here. Poke around to find answers to your questions around this data.
Have a lovely day!